Goto

Collaborating Authors

 natural language instruction


Supplementary Materials for On the Effects of Data Scale on Computer Control Agents

Neural Information Processing Systems

For completeness, in the following we include a datasheet based on the format of [1]. For what purpose was the dataset created? Was there a specific task in mind? Who created the dataset (e.g., which team, research group) and on behalf of which entity What do the instances that comprise the dataset represent (e.g., documents, photos, people, The dataset contains episodes of human demonstrations for mobile device control. How many instances are there in total (of each type, if appropriate)?








AVLEN: Audio-Visual-LanguageEmbodied Navigationin3DEnvironments

Neural Information Processing Systems

Similartoaudio-visual navigationtasks,thegoalofourembodied agentistolocalize anaudioeventvia navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language.


AndroidInTheWild: A Large-Scale Dataset For Android Device Control

Neural Information Processing Systems

There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AitW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13), and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance, and, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset.


Introspective Planning: Aligning Robots' Uncertainty with Inherent Task Ambiguity

Neural Information Processing Systems

Large language models (LLMs) exhibit advanced reasoning skills, enabling robots to comprehend natural language instructions and strategically plan high-level actions through proper grounding. However, LLM hallucination may result in robots confidently executing plans that are misaligned with user goals or even unsafe in critical scenarios. Additionally, inherent ambiguity in natural language instructions can introduce uncertainty into the LLM's reasoning and planning. We propose introspective planning, a systematic approach that guides LLMs to refine their own uncertainty in alignment with inherent task ambiguity. Our approach constructs a knowledge base containing introspective reasoning examples as post-hoc rationalizations of human-selected safe and compliant plans, which are retrieved during deployment. Evaluations on three tasks, including a new safe mobile manipulation benchmark, indicate that introspection substantially improves both compliance and safety over state-of-the-art LLM-based planning methods. Additionally, we empirically show that introspective planning, in combination with conformal prediction, achieves tighter confidence bounds, maintaining statistical success guarantees while minimizing unnecessary user clarification requests.